ABSTRACT OF THE DISCLOSURE 

An apparatus and method for determining if a query document 
matches one or more of a plurality of documents in a database. In a coarse 
5 matching stage, a compressed file or other query document is scanned to 
produce a bit profile. Global statistics such as line spacing and text height are 
calculated from the bit profile and used to narrow the field of documents to 
be searched in an image database. The bit profile is cross-correlated with bit 
profiles of documents in the search space to identify candidates for a detailed 

10 matching stage. If multiple candidates are generated in the coarse matching 
stage, a set of endpoint features is extracted from the query document for 
detailed matching in the detailed matching stage. Endpoint features contain 
sufficient information for various levels of processing, including page skew 
and orientation estimation. In addition, endpoint features are stable, 

15 symmetric and easily computable from commonly used compressed files 
including, but not limited to, CCITT Group 4 compressed files. Endpoint 
features extracted in the detailed matching stage are used to correctly identify 
a matching document in a high percentage of cases. 
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